attention pattern
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Education (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Artificial Intelligence > Vision (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia > China > Hong Kong (0.04)
- (6 more...)
- Overview (0.68)
- Research Report (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Asia > Middle East > Jordan (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (5 more...)
- Asia > Middle East > Jordan (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (5 more...)
Faster Neighborhood Attention: Reducing the O(n 2) Cost of Self Attention at the Threadblock Level
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention.